Amplifiable protein identification via residue-resolved barcoding and composition code counting

ABSTRACT Ultrasensitive protein identification is of paramount importance in basic research and clinical diagnostics but remains extremely challenging. A key bottleneck in preventing single-molecule protein sequencing is that, unlike the revolutionary nucleic acid sequencing methods that rely on the polymerase chain reaction (PCR) to amplify DNA and RNA molecules, protein molecules cannot be directly amplified. Decoding the proteins via amplification of certain fingerprints rather than the intact protein sequence thus represents an appealing alternative choice to address this formidable challenge. Herein, we report a proof-of-concept method that relies on residue-resolved DNA barcoding and composition code counting for amplifiable protein fingerprinting (AmproCode). In AmproCode, selective types of residues on peptides or proteins are chemically labeled with a DNA barcode, which can be amplified and quantified via quantitative PCR. The operation generates a relative ratio as the residue-resolved ‘composition code’ for each target protein that can be utilized as the fingerprint to determine its identity from the proteome database. We developed a database searching algorithm and applied it to assess the coverage of the whole proteome and secretome via computational simulations, proving the theoretical feasibility of AmproCode. We then designed the residue-specific DNA barcoding and amplification workflow, and identified different synthetic model peptides found in the secretome at as low as the fmol/L level for demonstration. These results build the foundation for an unprecedented amplifiable protein fingerprinting method. We believe that, in the future, AmproCode could ultimately realize single-molecule amplifiable identification of trace complex samples without further purification, and it may open a new avenue in the development of next-generation protein sequencing techniques.


INTRODUCTION
Ultrasensitive protein identifications wi l l bring breakthrough technology to life sciences as well as propelling clinical diagnostics.Although the rapid development of mass spectrometry-based proteomics strategies has greatly facilitated protein identification from complex samples, lowabundance proteins from trace samples often fail to be identified due to the detection limit and dynamic range of mass spectrometry [1 ,2 ].The similar challenges in genomics and transcriptomics have been successfully overcome by using nucleic acid amplification methods.For example, in the next-generation sequencing, the polymerase chain reaction (PCR) and other amplification methods have allowed quantitative DNA and RNA analysis from low copy numbers [3 ], which has revolutionized genomics and transcriptomics research [4 ].However, we are sti l l waiting for a similar breakthrough in protein amplification and identification.
Since traditional protein identification methods such as mass spectrometry cannot fill the gap, researchers are making efforts to develop alternative novel protein analysis methods [5 ,6 ].The emerging methods, such as single-molecule Edman degradation [7 ,8 ], N-terminal amino-acid-specific binding probes [9 ], nanopore analysis [10 -17 ], recognition tunneling [18 ,19 ], single-molecule mass spectrometry [20 ,21 ], DNA-nanotechnology-based protein identification and so on [22 -25 ], can be classified as single-molecule protein identification [26 ,27 ].These methods have great potential in ultrasensitive protein identification, but all of them rely on the ultrasensitive single-molecule analytical instruments to detect signals from a single amino acid, peptide or protein.
We envision that an alternative solution to the problem could be amplifiable protein identification.Although there has been no reported natural machinery to directly replicate proteins from a template, converting the protein sequence information into amplifiable barcodes such as DNA might be an approach.Previous 'immuno-PCR' methods have enabled protein detection by using DNA-antibody conjugates with high specificity (by the antibody) and sensitivity (by DNA amplification) [28 ,29 ], but only specific target proteins with available antibodies can be detected.With the upsurge in collaborations between organic chemists and chemical biologists in recent years, many residue-specific reactions have been invented that have broadened the toolbox for precise protein engineering, functional modulation and activity-based proteomics [3 0 - 3 4 ].Such metal-trigged, photo-trigged, electro-trigged or spontaneous chemical reactions range from electrophilic and nucleophilic substitution to redox reaction [3 0 - 3 4 ].Besides the most well-known reactions for lysine (Lys or K) and cysteine (Cys or C), residue-specific chemistry has also been expanded to various types of amino acids including aspartic acid (Asp or D)/glutamic acids (Glu or E), tyrosine (Tyr or Y), arginine (Arg or R), methionine (Met or M), histidine (His or H), tryptophan (Trp or W), serine (Ser or S) and so on [32 -38 ].Considering that many newly developed protein identification approaches are based on the database-matching of the protein fingerprints including the number of amino acids [8 ,24 -27 ], we believe that using DNA barcodes to record part of the sequence information can allow protein fingerprinting identification in an amplifiable and universal manner [39 ].
Herein, we report an amplifiable protein fingerprinting method, termed ' AmproCode' , by integrating residue-resolved DNA barcoding, quantitative DNA amplification, composition code counting and computer-aided database-matching (Fig. 1 a).We reasoned that, through residue-specific chemical reactions, several types of residues can be selectively and quantitatively labeled with DNA barcodes, which are leveraged to magnify the fingerprints of trace samples via quantitative PCR (qPCR) amplification.The residue-resolved composition code is generated by the relative ratio of these amino acids in the target protein as measured by using qPCR.
Since the sequence information of all proteins in the human proteome database can be converted into a composition code library, AmproCode allows matching between the experimentally obtained composition code and the codes in the library using a customized database search algorithm, which may facilitate amplifiable protein identification.

Theoretical justification for AmproCode
We envision that, as proteins vary in their sequences, the relative ratio of their amino acids (the 'composition code') can be employed as a unique fingerprint to facilitate protein identification.Inspired by the previous work on protein fingerprinting, we believed that the composition code of a partial set of residues in a given protein would be sufficient for identification.We first established a composition code library from the whole human proteome database (Swiss-Prot) including 20 588 reviewed peptides and proteins with manual annotations [40 ].For each protein, its composition code is numerically represented as the ratio of all the nine residue types that can be modified by specific chemical reactions including Cys, Lys, Asp/Glu, Tyr, Arg, Met, His, Trp and Ser [30 -38 ].We found that > 98% of the proteins in the whole human proteome had unique composition codes consisting of these nine residue types, which suggested that nearly all human proteins could be distinguished by their composition codes in theory (Fig. 1 b).
We next aimed to establish a proof-of-concept model to evaluate the feasibility of applying Am-proCode for protein identification if a partial set of residues are chemically labeled, amplified by DNA barcodes, and accurately quantified in relative composition.For this purpose, we developed a computational tool to compare the input composition code with the code library from the proteome database and ranked the protein entries in the proteome based on the cosine similarity value.Here we used vectors to record the composition codes of the input data and all the protein entries in a proteome database, and the cosine similarity could be calculated by using the following formula: cosine .similarity= u • v u v where u and v represent two vectors of two composition codes.From the computational analysis, we found that, with an increase in residue types in the composition code, the coverage of AmproCode in the whole proteome improved accordingly.When the residue types increased from three (Cys, Lys and Met) to four (adding Asp/Glu), the theoretical The workflow of amplifiable protein identification through AmproCode.Each type of selected residue on peptide or protein samples is modified by using DNA barcodes, respectively.The residue-resolved composition code is generated by the relative ratio of these residues from each protein, which can be amplified and quantified by using qPCR.The sequence information of all proteins in the human proteome database can be converted into the composition code library.Matching the experimentally obtained composition code with the codes in the database library by using the database search algorithm allows the identification of the sample.(b) Theoretical evaluation of the coverage rate of AmproCode in the whole human proteome.(c) Theoretical evaluation of the coverage rate of AmproCode in the secretome.The residue types in the composition code range from two to nine in the order of Cys, Lys, Met, Asp/Glu, Tyr, Arg, His, Trp and Ser.Targeting the highest single hit and the top five hits in identification are evaluated respectively.coverage of the human proteome could be raised from 34% to 89% in the absence of experimental errors (Fig. 1 b).Besides the composition code of Cys, Lys and Met and Asp/Glu, we also estimated other five combinations and all of them could cover > 75% of the human proteins (e.g.81% for the combination of Cys, Lys and Met and Tyr in Fig. S1).Although, in certain cases, some proteins and peptides may have degenerated composition codes that can compromise the coverage, AmproCode could sti l l narrow down the analyte from the whole proteome to a few candidates that could be further verified.Indeed, if the restrictions on protein identification in this model could be relaxed from targeting the highest single hit ('top 1') to the top five hits ('top 5') in the code library, then the proteome coverage rate could be raised to 99% using the composition code of Cys, Lys and Met and Asp/Glu (Fig. 1 b).Therefore, the composition code containing four residues is sufficient for protein identification.
Often, in practical applications, it is only necessary to analyse a subset of rather than the whole proteome in many biological samples.For example, the secretome is a critical portion of the whole proteome with great clinical relevance, with which many proteins mediate endocrine communication and regulate fundamental homeostatic processes [41 ], and may serve as a novel biomarker and/or potential drug target for clinical diagnosis and treatment.To investigate the compatibility of AmproCode with the secretome, we also established the composition code library of human secretome including 2675 mature chains or active peptides from 2112 precursor proteins with the annotation of 'secreted' in Swiss-Prot.Considering that some secreted peptides are processed from the same protein, we therefore combined them in the same protein entry in the database search.The computational analysis showed that, with the composition code of four residue types, AmproCode also had the capacity for protein identification in the secretome.For example, the composition code of Cys, Lys, Met and Asp/Glu could cover 94% of the secreted proteins (top 5) while the combination of Cys, Lys, Met and Tyr could cover 90% (top 5) (Fig. 1 c and Fig. S1).These computational analysis results thus provided the theoretical basis for our method.

Residue-specific DNA barcoding
Residue-specific chemical reactions laid the ground for AmproCode because DNA barcodes need to be selectively attached to amino acid side chains of the peptide or protein with high specificity and efficiency [42 ].Considering the availability, specificity and efficiency of reported residue-specific labeling reactions, we initially tested five residuespecific reactions targeting Cys, Lys, Met, Asp/Glu and Tyr: (i) maleimide reagents to functionalize Cys; (ii) N -hydroxysuccinimide (NHS) esters to functionalize Lys; (iii) redox-activated chemical tagging by oxaziridines on Met; (iv) amide condensation on Asp/Glu with the assistance of (7-azabenzotriazol-1-yloxy) tripyrrolidinophosphonium hexafluorophosphate (PyAOP); and (v) formylbenzene diazonium to modify Tyr.These reactions hold certain advantages including the wide availability of regents, high reactivity under mild conditions and good specificity to target residues, and they have been successfully applied in mass spectrometry-based proteomic workflow before [43 ], suggesting their practicability in AmproCode.
In a small-molecule system, each of the five amino acids with the 9-fluorenylmethyloxycarbonyl (Fmoc) group was reacted with the corresponding labeling reagents and the reaction efficiency was measured based on the ultraviolet absorption of the Fmoc group using analytical high-performance liquid chromatography (HPLC).We observed an excellent conversion rate ( ∼99%) for Cys, Lys, Met and Asp/Glu on the small-molecule level.In comparison, the conversion rate of Tyr was only 92% (Fig. 2 a and Figs S2 and S9).Considering that the performance of Tyr was not as good as that of the other four residues and the theoretical justification proved that four residues were sufficient for protein identification, we chose the combination of Cys, Lys, Met and Asp/Glu as the initial combination code and left Tyr as an alternative candidate in AmproCode, especially in certain cases containing the Tyr residue.
We then evaluated the labeling efficiency and selectivity of these reactions on the peptide level.We commercially synthesized a model peptide containing each of the Cys, Lys, Met and Asp/Glu residues.Cys and Lys were modified in a one-pot reaction, followed by Met and Asp modification, respectively.The high efficiency and specificity ( > 95%) were validated by the ultraviolet absorption using HPLC and mass spectrometry ( Figs S3 and S10).
Next, we developed the residue-specific DNA barcoding scheme (Fig. 2 b and c, and Fig. S4): (i) Cys residues were directly barcoded by using the maleimide-conjugated DNA reagent (Mal-DNA); (ii) Lys residues were functionalized by using the thioester-modified NHS ester and barcoded by using Mal-DNA in the presence of hydroxylamine (or the azide-modified NHS ester and the dibenzocyclooctyne-conjugated DNA reagent, DBCO-DNA) after protecting Cys by using N -methyl maleimide; (iii) Met residues were functionalized by using the azide-modified oxaziridine and barcoded by using DBCO-DNA after protecting Cys and Lys by using N -methyl maleimide and the N -acetoxysuccinimide, respectively; (iv) Asp/Glu residues as well as the C-termini of the protein were labeled using the azide-modified amine and barcoded by using DBCO-DNA after protecting Cys and Lys.We applied the DNA barcoding strategy ( Continued ) step, the functional groups including azide and thioester are introduced to the targeted residues while some reactive amino acids are protected.In the second barcoding step, these functional groups are further labeled by using Mal-DNA or DBCO-DNA barcode reagents.(d) The peptide with Tyr residue is modified through a two-step scheme.In the first step, the azide group is introduced to the Tyr residues and, in the second step, the azide group is further labeled by using DBCO-DNA.
on the synthetic model secreted hormone peptide, EL A , which functions in cardiovascular development and homeostasis, and can help reduce maternal mortality [44 ,45 ].The peptide contained all the four types of residues including two Cys residues, two Lys residues, two Met residues and a C-terminal carboxylate.We divided the peptide sample into four aliquots for, respectively, DNA barcoding with the designed common DNA barcode.The products including ELA-C-(DNA) 2 , ELA-K-(DNA) 2 , EL A-M-(DNA) 2 and EL A-C-terminal-DNA were confirmed by using mass spectrometry ( Fig. S11), proving our residue-specific DNA barcoding design.Furthermore, we also designed the Tyr-specific peptide barcoding workflow based on the azo coupling reaction so that we could have more combination choices to improve proteome coverage and/or facilitate experimental procedures.The Tyr residues were labeled via a one-pot reaction with the formylbenzene diazonium and the azide-modified hydroxylamine reagents, and the azide group attached to residues was further modified by using the DBCO-DNA (Fig. 2 b and d).We validated the workflow on a peptide with a Tyr residue ( Fig. S5).

Amplifiable peptide fingerprinting via AmproCode
After accomplishing the residue-specific DNA barcoding, we proceeded to explore amplifiable protein fingerprinting on the ELA peptide.We first tested the qPCR performance of the DNA peptide conjugates including the linear range of detection and bias to each labeled residue ( Fig. S6).We found that the linear range of detecting the DNA-barcoded peptide was 10 4 copies/ μL (10 fmol/L) to 10 8 copies/ μL (0.1 nmol/L) by using qPCR with excellent correlation ( R 2 > 0.9 9 9), suggesting a broad dynamic range (10 5 ) and low detection limit (10 4 copies/ μL).We also found that the DNA barcodes that attached to different residues introduced negligible bias to the qPCR.
To quantitatively obtain a count ratio of these chemically labeled amino acids based on DNA barcoding signals, we also need to incorporate an internal standard for normalization because, according to the barcoding workflow, the model peptide sample should be divided into four aliquots for DNA barcoding, respectively, and the sample loss after purification and other steps might vary differently.We therefore synthesized a fluorescent dye (e.g.tetramethylrhodamine TAMRA) to the N-terminal of the peptides so that the four residue-specific DNA-barcoded samples could be calibrated by the N-terminal fluorescence (Fig. 3 a).Each of the four residue types of the synthetic TAMRA-ELA was chemically labeled with the DNA barcode and amplified by using qPCR, respectively (Fig. 3 b and Fig. S12).Through qPCR of samples at 10 8 copies/ μL, the obtained composition code (Cys:Lys:Met :Asp/Glu/C -terminal) of ELA was 0.97:1:0.98:0.49,which was consistent with the theoretical values (1:1:1:0.5).We searched for the obtained code in the library of secretome and found that ELA was identified in the top-five-ranking list with the highest similarity value (Fig. 3 c).
After stepwise dilution, we showed that, at a peptide concentration of as low as 10 4 copies/ μL ( ∼0.07 pg/mL), a similar composition code of the ELA peptide could also be obtained as 1.08:1:0.93:0.50, which was also sufficient to identify the ELA peptide in the secretome (Fig. 3 c).Since the detection limit of a common ELISA kit for ELA is ∼1-100 pg/mL and the detection limit of a typical mass spectrometry is 10 6 -10 9 copies [46 ,47 ], the sensitivity of AmproCode is ∼10-10 0 0 0 times better than those of the two common protein identification methods despite the fact that we have not yet realized single-molecule amplifiable fingerprinting.These experiments proved that AmproCode had the capability for amplifiable protein identification of a trace sample with extremely low concentrations.
In addition to EL A , we also applied Am-proCode to identify another synthetic peptide, URP, which is a potent physiological vasoconstrictor with essential roles in hypertension [48 ].After DNA barcoding and qPCR amplification at 10 8 copies/ μL, the obtained composition code (Cys:Lys:Met :Asp/Glu/C -terminal) of URP was 2.04:1:0.00:0.95,which was sufficient to identify URP from the secretome (Fig. 3 d and e, and Fig. S13).

A β peptide identification via AmproCode from sample mixtures
We have proved that AmproCode has the capacity to identify purified peptides.In principle, the ultimate version of AmproCode in the future could operate in a single-molecule mode in which a single (c) qPCR quantified composition code of ELA is 0.97:1:0.98:0.49at 10 8 copies/ μL (top) and 1.08:1:0.93:0.50 at 10 4 copies/ μL (bottom) (average of three independent replicates).After matching these two composition codes with the code library from the secretome, the ELA peptide (protein entry: ELA) can be identified as the top candidate both at 10 8 and 10 4 copies/ μL with the highest score.(d) and (e) Identification of the URP peptide through AmproCode using the composition code of Cys, Lys, Met and Asp/Glu.(d) Four residue types of TAMRA-URP including Cys, Lys, Met and Asp/Glu/C-terminal are labeled by using the DNA barcode, respectively.(e) qPCR quantified composition code of URP is 2.04:1:0.00:0.95 at 10 8 copies/ μL (average of three independent replicates).The URP peptide (protein entry: UTS2B) can be identified in the top two candidates with the highest score.
protein is a pure substance after diluting into an isolated microenvironment such as a microwell on chips ( Fig. S8).Moving toward this direction, we attempted to extend the application of the current AmproCode including analysis of some specific target proteins of interest from complex samples after additional isolation or enrichment steps.A β peptide was chosen as the detection target because it plays an essential role in Alzheimer's disease (AD) [49 ] and both soluble A β peptide and A β plaque are important clinical biomarkers as well as therapeutic targets [50 ].More conveniently, SrtA β, a transpeptidase sortase A variant, has been recently evolved to recognize the LMVGG sequence of the A β protein and attach a poly-glycine peptide with functional groups after the LMVGG sequence [51 ].We thus aimed to leverage SrtA β to assist A β isolation in a two-step workflow.First, the TAMRA fluorescent dye was conjugated with the A β peptide by SrtA β in a sample mixture, which contained over six protein or peptide components including humanin-like 9, URP, NY-ESO-1 (157-165), A β, GGGK(TAMRA), SrtA β, as well as impurities.Second, the product, A β-TAMRA, was then easily isolated from the mixture by using HPLC based on its unique absorption spectrum (Fig. 4 a-c and Fig. S7).Moreover, the fluorescence of TAMRA could also be used for sample calibration in the subsequent analysis.Considering that the A β peptide contains six Asp/Glu residues but only one Tyr residue, we reasoned that quantitation of this single Tyr would be more straightforward and accurate than the quantitation of six acidic residues.The Tyr barcoding strategy prov ided us w ith the choice to replace A sp/Glu w ith Tyr to facilitate A β identification.After DNA barcoding and qPCR amplification of Cys, Lys, Met and Tyr ( Fig. S14), the composition code was quantified as 0.00:1:0.34:0.33,which was consistent with the theoretical ratio of 0:1:0.33:0.33.After the obtained composition code in the database was searched for, A β peptide was found among the top-five candidate list (Fig. 4 d).Furthermore, as it was reported that SrtA β had the capacity of modifying A β peptide with the biotin affinity handle in human cerebrospinal fluid (CSF) and blood [51 ], we thought that combining it with SrtA β for A β enrichment and identification might pave the way for AmproCode to be implemented in real-world applications in the future ( Fig. S7).

Simulating the error tolerance of AmproCode
Although we have theoretically analysed the coverage of AmproCode and experimentally identified three model peptides including EL A , URP and A β in the secretome database via AmproCode, we also acknowledged that experimental variations that were not considered in our previous computational estimations would influence the protein identification rate to various degrees.Thus, we assessed the secretome coverage of our method that was equal to the corrected identified protein rate (accuracy) by using computational simulations in terms of two parameters: the types of quantitated residues and the experimental precision that was defined as the relative standard deviation (RSD) of the quantitated relative ratio between the selected amino acids in the composition code.We first simulated the coverage using the two sets of composition codes: (i) Cys, Lys, Met, Asp/Glu; and (ii) Cys, Lys, Met, Tyr.Interestingly, although the theoretical coverage of Asp/Glu was higher when the RSD was equal to 0, the composition code with Tyr behaved better than A sp/Glu, w ith an RSD of ≥3%, indicating that the Tyr residue had a higher error tolerance (Fig. 5 a).However, we had to admit that the coverage or the accuracy declined fast with the increasing RSD, even if we chose the Tyr code.
Labeling and quantifying more residues would be an effective way to improve the error tolerance of AmproCode.Based on the computational estimation results, when all the nine residues with specific chemical reactions were labeled and quantified, the coverage could be maintained at > 96% even when the RSD was loosened to 10% (Fig. 5 b).We found that quantitation of one more residue could bring significant improvement.For example, if Tyr was added to the combination of Cys, Lys, Met and Asp/Glu, then the correct identification rate would rise from 44% to 71% with 10% RSD, representing a 1.6-fold improvement (Fig. 5 b), and it could match the accuracy of some other newly reported protein fingerprinting identification methods such as Edman (d) Identification of A β peptide via AmproCode using the composition code of Cys, Lys, Met and Tyr.qPCR quantified composition code of A β is 0.00:1:0.34:0.33 (Cys:Lys/N-terminal:Met:Y) at 10 8 copies/ μL (average of three independent replicates).After matching this composition code with the code library from the secretome, A β (protein entry: A4) can be identified in the top four candidates with the highest score.
fluorosequencing (40%), DNA nanoswitch calipers (76% with a probability of > 90%) and FRET-X (39%-91% under different conditions) [8 ,24 ,25 ].In one particular case, when we added Tyr to the previously obtained composition code of the URP peptide, the new composition code was generated as 2.04:1:0.00:0.95:1.00(Cys:Lys:Met:Asp/Glu/ C-terminal:Tyr) and the cosine similarity value be-tween the measured composition codes of the URP and the other proteins decreased.For example, the similarity value of the No.3 protein entry 'YQ032' decreased from 0.994 (Fig. 3 d) to 0.988 (Fig. 5 c), suggesting that Tyr quantitation brought greater distinction between other proteins and more error tolerance on protein identification.Since the adaptability of AmproCode is related to the number of residue types, we would like to apply more alternative chemical reactions to modify and quantify more residues (e.g.kethoxal for Arg barcoding ) to fur ther increase its coverage and accuracy in the future.

CONCLUSION
In summary, we have reported a proof-of-concept study of the amplifiable protein fingerprinting method AmproCode, which could magnify the composition code of proteins via residue-resolved DNA barcoding and amplification.Theoretically, we showed that the composition code of several residues on a peptide or protein of interest was adequate for fingerprinting and identification from a database derived from the whole proteome or the secretome.We also estimated the coverage of our method with different experimental parameters by computational simulations, supporting our method under more realistic conditions.Experimentally, we designed the DNA barcode and attached it to several types of amino acids on peptides, including Cys, Lys, Met, Asp/Glu as well as Tyr residues, via a panel of highly efficient and mutually orthogonal residue-specific reactions.The DNA barcode on peptides was amplified and quantified by using qPCR, which yielded the composition code of the analytes with greatly improved sensitivity.We applied AmproCode to identify three model peptides including EL A , URP and A β.The fmol/L concentration level of the peptide could be detected, i l lustrating the potential of AmproCode in tracing protein identifications.
Although our initial trials on the synthetic model peptides were successful, we also acknowledge that the labeling inaccuracies, on the proteome level, could not be avoided due to the inherent challenge of the varying reactivities of residues in complex protein chemical environments.This problem might be overcome to a certain extent by the combination of computational simulation and experimental optimization.Given that the statistical labeling rates and off-target rates of a specific reaction on the proteome level could be evaluated from a large number of experimental and simulated samples, an additional calibration program according to these statistical data could be added to our database-matching algorithm to obtain a more accurate and realistic protein identification result [8 ,24 ].Furthermore, during the AmproCode workflow, the proteins can be denatured for DNA barcoding, which could further reduce the complexity of the chemical environments.We envision that our initial success wi l l draw more attention from organic chemists and protein chemists to optimize old reactions and develop new ones that may target more residues with higher selectivity and efficiency.Taken together, additional experimental efforts as well as computational auxiliary tools could improve AmproCode significantly in terms of coverage, accuracy and adaptability.
The current form of AmproCode is only a simplified model for fingerprinting purified or isolated peptides found in the database.In order to analyse complex biological or clinical samples, several other approaches have to be employed to improve the AmproCode method towards a single-molecule technique.Considering the fmol/L level of sensitivity that we have reached, applying more advanced DNA amplification techniques may pave the way for single-molecule AmproCode ( Fig. S8).Digital PCR (dPCR) can greatly improve DNA quantitation accuracy at an extremely low concentration [52 ].It was reported that dPCR had the capacity to absolutely quantify a level of samples of 0.1-1 copies/ μL with reduced experimental variation compared with real-time qPCR [53 ].Multiplex PCR is another improvement in residue quantitation because traditional qPCR detects one analyte in one reaction and multiplex PCR allows the quantification of multiple analytes at the same time [54 ].We may realize singleprotein amplifiable fingerprinting identification in the future with the help of multiplex dPCR.Furthermore, state-of-the-art microfluidic and automated liquid handling techniques are able to isolate diluted samples into microwell chips at the single-molecule level for high-throughput protein identification [55 ], so additional purification methods could be replaced in AmproCode ( Fig. S8).The experimental errors introduced by manual operation during the sample processing workflow could also be reduced.
Taken together, by proteome database-matching for protein identification, the novel concept of amplifiable protein fingerprinting using AmproCode may open a new avenue towards the development of next-generation protein identification and/or sequencing techniques.Our computational and experimental results represent a prototype that could be further developed as a high-throughput method towards single-cell proteomics and the discovery of clinical biomarkers.

Materials
Peptides were synthesized by GenScript and Hangzhou ALLPEPTIDE Biotechnology.DNA oligos were synthesized by Generay and GenScript.The plasmid was synthesized by GENEWIZ.Fluorescent labeling reagents were purchased from Confluore.DNA modification reagents were purchased from Confluore.Residue-specific peptide modification reagents except Met were purchased from Confluore and 9 Ding Chemistry.The Met specific modification oxaziridine reagent Ox6 was a gift from Prof. Shixian Lin in Zhejiang University.The mobile phases for HPLC including acetonitrile (ACN) and water were purchased from Thermo Fisher Scientific and Wahaha, respectively, and the additives including formic acid, acetic acid, trifluoroacetic acid and triethylamine were purchased from Macklin, J&K Scientific, Energy-Chemical and Thermo Fisher Scientific, respectively.qPCRrelated reagents were purchased from YE A SEN.

Database search algorithm
For sequence matching with a target database, 'cosine similarity' was chosen as the distance measurement.We implemented a simple python script using the scikit-learn package to calculate the cosine distances from the query sequence to all the sequences within the targeted database by using scipy.spatial.distance.cosineand collected the top N results [56 ].We manually calculated the 'cosine similarity' by using 'one minus the cosine distance value' in order to display the results more clearly in this manuscript.Sequences of human proteome were obtained from UniProt release 20 21_0 2 [40 ].All known secreted peptide and protein sequences were generated according to 'PROPEP' records within the database.

Theoretical evaluation of the coverage rate
To evaluate the coverage of our AmproCode strategy, we iteratively searched the composition code of each sequence against the whole database.If the query sequence was ranked in the top N ( N = 1 or 5) results, then it was considered as 'being identified' .The proteome coverage was defined as the percentage of successfully identified proteins.The influence of measurement error was also considered.A Gaussian probability distribution Norm ( μ, σ 2 ) was assigned to all the composition codes in the database, where μ was the true value of the composition code and σ was the standard deviation of measurements.Thus, using random numbers drawn from Norm ( μ, σ 2 ), we obtained a series of new composition codes that simulated the composition codes with the measurement errors.Then the coverage was calculated as described in the previous paragraph.

qPCR measurement and data analysis
The qPCR reaction system contained 2 μL of samples, 10 μL of Hieff qPCR SYBR Green Master Mix Low Rox Plus (YE A SEN, Cat# 11202ES03), 0.2 μL of the forward primer (10 μM), 0.2 μL of the reverse primer (10 μM) and 7.6 μL of water in the 96-well PCR plate.qPCR was performed on the Applied Biosystems ViiA 7 system (Thermo Fisher Scientific) and the program was as follows: 94°C, 3 min; 32-40 cycles of 94°C, 10 s; 49°C, 20 s; and 72°C, 30 s.In one qPCR measurement, a sample was measured paral lel ly in three wells on one plate paral lel ly and the readout mean Ct data were used to calculate one relative ratio.The average of three relative ratios from three independent qPCR measurements was the final result.

Figure 1 .
Figure 1.Design and overview of AmproCode.(a)The workflow of amplifiable protein identification through AmproCode.Each type of selected residue on peptide or protein samples is modified by using DNA barcodes, respectively.The residue-resolved composition code is generated by the relative ratio of these residues from each protein, which can be amplified and quantified by using qPCR.The sequence information of all proteins in the human proteome database can be converted into the composition code library.Matching the experimentally obtained composition code with the codes in the database library by using the database search algorithm allows the identification of the sample.(b) Theoretical evaluation of the coverage rate of AmproCode in the whole human proteome.(c) Theoretical evaluation of the coverage rate of AmproCode in the secretome.The residue types in the composition code range from two to nine in the order of Cys, Lys, Met, Asp/Glu, Tyr, Arg, His, Trp and Ser.Targeting the highest single hit and the top five hits in identification are evaluated respectively.

Figure 2 .Figure 2 .
Figure 2. Residue-resolved DNA barcoding on Cys, Lys, Met and Asp/Glu via residue-specific reactions.(a) The scheme and conditions for residue-specific reactions on Cys, Lys, Met, Asp/Glu (as well as C-terminal) and Tyr.(b)-(d) Scheme of residue-specific DNA barcoding on peptides.(b) The DNA barcoding scheme is based on the residue-specific reactions of Cys, Lys, Met, Asp/Glu and Tyr.(c) Cys is directly labeled by methylmaleimide-modified DNA barcodes while Lys, Met and Asp/Glu resides are modified through a two-step, residue-specific functionalization and barcoding procedure, respectively.In the first

10 Figure 3 .Figure 3 .
Figure3.Amplifiable peptide fingerprinting via AmproCode using the composition code of Cys, Lys, Met and Asp/Glu.(a) Schematic illustration of amplifiable peptide fingerprinting through AmproCode.After normalizing the four DNA-barcoded samples using fluorescence, the composition code of Cys, Lys, Met and Asp/Glu of the target protein can be generated by using quantitative PCR amplification, which can be used as the fingerprint to match the composition code library via a database search algorithm.(b) and (c) Identification of the ELA peptide through AmproCode using the composition

Figure 4 .
Figure 4. AmproCode for A β detection from complex mixtures.(a) Transpeptidase SrtA β-mediated A β modification for AmproCode.In a complex sample, SrtA β recognizes the LMVGG sequence in the A β peptide and conjugates GGG-TAMRA to it.The product A β-TAMRA can be applied in the AmproCode workflow.(b) HPLC analysis of the sample mixture after labeling A β with TAMRA.The absorption spectra are chosen at 200 and 353 nm (TAMRA-specific absorption).(c) Mass spectrum of A β-TAMRA.The theoretical molecule weight (MW) is 4786 Da and the observed MW is 4787 Da.(d) Identification of A β peptide via AmproCode using the composition code of Cys, Lys, Met and Tyr.qPCR quantified composition code of A β is 0.00:1:0.34:0.33 (Cys:Lys/N-terminal:Met:Y) at 10 8 copies/ μL (average of three independent replicates).After matching this composition code with the code library from the secretome, A β (protein entry: A4) can be identified in the top four candidates with the highest score.

Figure 5 .
Figure 5.Error tolerance of AmproCode.(a) Simulations of the percentage of correctly identified proteins (coverage) in the secretome via AmproCode using the composition code of Cys, Lys, Met and Asp/Glu (C,K,M,D/E) or Cys, Lys, Met and Tyr (C,K,M,Y) with different experimental precision (0.05%-10% RSD).The coverage represents a statistical average of five independently simulated coverage values.(b) Simulations of the correctly identified protein rates (coverage) of AmproCode in the secretome with different experimental precision (0.05%-10% RSD) and different residue types (four, five, six and nine in the order of Cys, Lys, Met, Asp/Glu, Tyr, Arg, His, Trp and Ser).The coverage represents a statistical average of five independently simulated coverage values.(c) qPCR quantified composition code with Tyr of URP is 2.04:1:0.00:0.95:1.00 at 10 8 copies/ μL (average of three independent replicates).The URP peptide (protein entry: UTS2B) can be identified in the top two candidates with the highest score.